A comparison of four language models for large vocabulary turkish speech recognition
نویسندگان
چکیده
This paper gives a comparison of three language models proposed as alternatives to word-based language model for large vocabulary speech recognition of Turkish. Turkish is an agglutinative language and has morphological productivity. This results in a huge vocabulary size and a large number of out of vocabulary words for unseen test data. The solution is to parse the words, in order to get smaller base units, which are capable of covering the language with relatively small vocabulary size. Three different ways of decomposing words into base units are described: Morphem-based model, stemending-based model and syllable-based model. These models are compared with respect to vocabulary size, coverage, number of out of vocabulary words, perplexity and sensitivity to context. For all three models, a significant improvement for those measures are observed compared to the word-based
منابع مشابه
Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملA unified language model for large vocabulary continuous speech recognition of Turkish
We have designed a Turkish dictation system for newspaper content transcription application. Turkish is an agglutinative language with free word order. These characteristics of the language result in vocabulary explosion, large number of out-of-vocabulary (OOV) words and an increased complexity of n-gram language models in speech recognition when words are used as recognition units. In this pap...
متن کاملAnalysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four “morphologically rich” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) r...
متن کاملLanguage independent and language adaptive large vocabulary speech recognition
This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Tu...
متن کاملLattice extension and rescoring based approaches for LVCSR of Turkish
In this paper, we present some techniques to solve the problems of Turkish Large Vocabulary Continuous Speech Recognition (LVCSR). Its agglutinative nature makes Turkish a challenging language in terms of speech recognition since it is impossible to include all possible words in the recognition lexicon. Therefore, data-driven sub-word recognition units, in addition to words, are used in a newsp...
متن کامل